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i 1 Abstract: 

Objective: Assessing the statistical power to detect suscepti- 
"^h bility variants plays a critical role in GWA studies both from 
+^ the prospective and retrospective points of view. Power is em- 
-4_> pirically estimated by simulating phenotypes under a disease 
i ^ . model HI. For this purpose, the "gold" standard consists in si- 
mulating genotypes given the phenotypes (e.g. Hapgen). We 
1— 1 introduce here an alternative approach for simulating pheno- 
\Jr*> tyP es under HI that does not require generating new geno- 
„_j- types for each simulation. Methods: In order to simulate phe- 
notypes with a fixed total number of cases and under a given 
disease model, we suggest three algorithms: i) a simple re- 
jection algorithm; ii) a numerical Markov Chain Monte-Carlo 
(MCMC) approach; iii) and an exact and efficient backward 
CN sampling algorithm. In our study, we validated the three al- 
gorithms both on a toy-dataset and by comparing them with 
;> Hapgen on a more realistic dataset. As an application, we 
J^j then conducted a simulation study on a 1000 Genomes Project 
r> dataset consisting of 629 individuals (314 cases) and 8,048 
SNPs from Chromosome X. We arbitrarily defined an additive 
disease model with two susceptibility SNPs and an epistatic 
effect. Results: The three algorithms are consistent, but back- 
ward sampling is dramatically faster than the other two. Our 
approach also gives consistent results with Hapgen. Using our 
application data, we showed that our limited design requires 
a biological a priori to limit the investigated region. We also 
proved that epistatic effects can play a significant role even 
when simple marker statistics (e.g. trend) are used. We fi- 
nally showed that the overall performance of a GWA study 
strongly depends on the prevalence of the disease: the larger 
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the prevalence, the better the power. Conclusions: Our ap- 
proach is a valid alternative to Hapgen-type methods; it is not 
only dramatically faster but also has two main advantages: 1) 
there is no need for sophisticated genotype models (e.g. ha- 
plotype frequencies, or recombination rates); 2) the choice 
of the disease model is completely unconstrained (number 
of SNPs involved, Gene-Environment interactions, hybrid ge- 
netic models, etc.). Our three algorithms are available in an 
R package called waf f ect ("double-u affect", for weighted af- 
fectations) . 

Introduction 

Genome-wide association (GWA) studies are a widely-used 
approach for localizing susceptibility variants responsible for 
common complex genetic diseases flTJ . Such studies involve 
investigating a huge number of genetics markers such as sin- 
gle nucleotide polymorphisms - SNPs (from hundreds of thou- 
sands to millions), for cohorts of cases and controls whose 
sizes range from thousands to tens of thousands of individu- 
als. GWA studies have met with many successes, most notably 
for type 1 and type 2 diabetes, inflammatory bowel disease, 
prostate cancer and breast cancer . 

In GWA studies, very high false positive rates are expected 
when there is no correction for multiple testing. Symmetri- 
cally, control for true negative rate - or power - is necessary. 
Power estimation is the key to evaluate the efficiency of GWA 
methods [3 |. The correct estimation of both rates must take 
into account the existence of high-dependency patterns be- 
tween SNPs, or linkage disequilibrium (LD) . The accurate es- 
timation of the family wise type I error risk in the presence 
of LD consists in sampling the HO distribution through per- 
mutations of phenotypes [4]. Thus, any association between 
loci and phenotypes is broken. This permutation strategy is 
implemented as a gold standard in numerous dedicated pack- 
ages, together with software suites designed for GWA studies 

EJEKZHSHU. 

Power is an even more complicated function of several fac- 
tors: study design, correlation patterns in the genotypic data, 
sizes of cohorts, frequency of the susceptibility allele, relative 
risk conferred by the causal factor, genetic model (additive, 
dominant, recessive, multiplicative) H10L As a consequence, 
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the analytical computation of power requires simplified as- 
sumptions, including the approximation of the test statistic 
distribution under HI through a probability law [11]. Most 
power calculators based on analytical approaches are used for 
two-stage GWA design, e.g. H121I13H . Recently, an analytical 
approach has been proposed to account for LD, under either 
HO or HI approximation fl 1411 : a fixed-size sliding window lo- 
cally accounts for the inter-marker correlation. This approach 
brings an improvement over block-wise strategies by unifying 
HO and HI processings in the same framework H151ll6l[T7ll . 
However, with regards to both accuracy and computational 
burden, the optimal choice of window size depends on the 
structure of the data. Moreover, LD blocks are often am- 
biguous. Thus, the previous sliding window approach cannot 
account for high-order dependencies between LD blocks. In 
particular, this method cannot be used to evaluate the power 
of any novel approach designed to cope with such high order 
dependencies in GWA studies. In this case, the only solution 
is to use computationally intensive simulations. 

Similar to sampling under HO, the simulation of the HI 
distribution is an appealing way to keep the LD-structured 
genotypic data. These simulations consist in generating case 
and control samples which mimic the LD structure in hu- 
man genome, i.e. in the creation of, say, k datasets under 
the HI assumption (at least one SNP is a susceptibility SNP). 
Nonetheless, breaking any association between a locus (or se- 
veral loci) and the phenotype is far easier than to introduce 
such an association in a dataset. 

Two main strategies have been developed to simulate HI. 
(i) The first one [11811 consists in first generating a large sam- 
ple of haplotypes conditional on reference haplotypes such 
as HapMap haplotypes [19]. The haplotypes are then paired 
to build diplotypes. The disease status is affected depending 
on the penetrance model which involves a susceptibility SNP 
selected at random, (ii) The second strategy [3] consists in 
first randomly selecting a disease SNP and generating a fixed 
number of cases and controls. Then diplotypes are assigned 
at the disease SNP for cases and controls depending on the 
penetrance model. Finally, haplotypes are built (two for each 
diplotype) for all remaining SNPs of the chromosome, condi- 
tional on reference haplotypes. Nevertheless, both strategies 
entail problems of implementation when applied to power es- 
timation in GWA studies. The drawback of the first strategy 
is that it does not make it possible to control the number of 
cases and controls. To tackle this issue, rejection sampling of 
case-control samples can be used, but this leads to a waste of 
time and data. An illustration of the first strategy is Fregene 
HI 811 . The second strategy makes it possible to control these 
numbers by first fixing them, but requires generating haplo- 
types for each simulation. The widely-used simulator Hapgen 
H201 implements the second strategy. 

To this aim, we propose a new method which randomly as- 
signs m cases and n controls conditional on n = n + m 
genotypes and a given disease model. Our idea is to affect 



the status of individual i according to both the probability 7T; 
that i is a case and to the constraint that the total number 
of cases is fixed. The probability 7Tj is computed from the 
disease model and is proportional to the relative risk for in- 
dividual i. This method provides several advantages. First, it 
generalizes permutations, which is the gold standard for ge- 
nerating HO distributions in order to control type I error in 
multiple testing. Indeed, permutations represent a particular 
case of our general approach which is obtained by using the 
same m for all individuals. Secondly, our method is faster 
than previous ones J3l 12 111 because it does not require a re- 
jection sampling step or the generation of new genotypes for 
each simulation. Moreover, in the case of LD modeling-based 
mapping methods, LD pattern identification needs to be per- 
formed only once. Thirdly, no assumption is made on the 
genotype distribution, because the latter is the same for each 
simulation. The last advantage is that power can be directly 
assessed using real GWA datasets, such as those provided by 
the 1000 Genomes Project [22), because disease status can 
be generated according to genotypes without loss due to re- 
jection sampling. This is most valuable when evaluating the 
performance of new mapping methods. 

The paper is organized as follows. We first describe the 
method itself, which we called waf f ect, then we validate our 
tool on a toy-example and compare its results to Hapgen H201 
simulations. We finally illustrate the interest of our approach 
with a power study conducted on real GWA data. 

Materials and Methods 

Detailed aims and notations 

We consider a GWA study with n = n + m individuals (no 
controls and m cases), and p SNPs. The i th phenotype is Fj e 
{0, 1} (Yi — 1 if individual i is a case, Yi — if individual i is a 
control). The i th genotype is G {0, 1, 2} p (this corresponds 
to the number of rare alleles for each SNP) . 

Our disease model, thereafter denoted HI, is defined by the 
probability 71-j = P(Yj = l|Xj) that individual i is a case condi- 
tional on her/his genotype. In the particular case when all 71-j 
have the same non-negative value, the observed genotype has 
no effect on the phenotype. This case corresponds to the null 
hypothesis HO. In the particular case when we consider a sin- 
gle SNP (p = 1), n.i takes only three possible values: m = f Q 
if X % = 0, Tti = h = /0RR1 if X t = 1, and tt, = f 2 = / RR 2 if 
Xi — 2, where / , /1 and / 2 are the penetrances of the disease 
model, and RRi and RR 2 are the relative risks. 

It is obviously easy to simulate under HI using the inde- 
pendent Bernoulli distribution B(ni) for each individual. Un- 
fortunately, it is quite unlikely that when doing so the case- 
control design constraint C — ^ = n i} wm> be- ful- 
filled. 

For this reason, the standard strategy (i.e. Hapgen) con- 
sists in simulating genotypes conditional on phenotypes using 
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Bayes' formula: 

i .1 ^2 x ^ Yi \x i = x)P(x i = x) p(y«) ' 

The advantage of this approach is that constraint C is au- 
tomatically fulfilled, whereas the major drawback is that a 
genotype distribution P(Xj) must be introduced. In order to 
be realistic, this imposes sophisticated modeling using haplo- 
type data and recombination rates. 

As an alternative, we suggest several ways to simulate phe- 
notypes under HI conditional on the genotypes (which there- 
fore remain untouched) while respecting constraint C. To do 
so, we introduce the notation Zj — J2i=i ^ (with Z = by 
convention) so that the constraint becomes C = {Z n = m}. 

Reject 

The first approach is straightforward: 1) draw (li)i=i... n 
using independent Bernoulli distributions; 2) if constraint C 
is fulfilled, retain {Yi)i = \... n as a valid sample; if not, simply 
reject it and return to 1). The problem with this algorithm 
is that samples will be rejected at rate 1 — P(C). The natural 
question is then: how (un-)likely is event C? 

P(C) can be computed recursively by introducing the for- 
ward quantities: Fi(m) — P(Zj = m). Starting with F (m) 
being equal to 0, except for F (0) = 1, we get the following 
recursive formula for 1 ^ i ^ n: 

Fi{m) = Fi-tim - 1)^ + F,_ 1 (m)(l - m). (1) 

It follows that the probability of the constraint is P(C) = 

p(Z n = m) = F n ( ni ). 

This approach is easy to understand and implement, howe- 
ver it usually has a very low rate of success. This is the case 
when the design is large and the itiS are low (for instance in 
the case of a small prevalence). In practice, its use will be 
limited to small toy-examples. 

MCMC 

In order to overcome this drawback, an interesting idea is 
to turn to Markov Chain Monte Carlo (MCMC) techniques 
such as the Metropolis-Hastings class of sampling algorithms 
H23L Starting from a configuration of phenotypes fulfilling 
the constraint (e.g. with the observed phenotypes), alternate 
the following two steps: 1) proposal move: select two individ- 
uals i and j such that Yi = 1 and Yj = and then exchange 
their phenotypes (Yi = and Yj = 1); 2) accept this move 
with rate a given by: 

= (1 - 7ri)7Tj 
7T<(1 - TTj)' 

It can be shown that the sequence of phenotype configura- 
tions generated by this algorithm is a Markov chain whose 
stationary distribution is our targeted constrained distribu- 
tion HI. In practice, the first configurations which are ge- 
nerated must be discarded since the Markov chain is not yet 



stationary (this stage is called burn-in). In order to obtain a 
set of independent configurations under HI, one can either 
repeat the burn-in phase of the algorithm as many times as 
necessary, or use a sequence of pseudo-independent samples 
by picking phenotype configurations after the burn-in once 
every fixed number of iterations. 

The main advantage of the MCMC approach is that it will 
eventually work regardless of the probability of event C. Its 
drawback is that the number of iterations for burn-in and 
pseudo-independence necessary for good mixing increases at 
least linearly with the total number n of individuals. More- 
over, the control for stationarity and pseudo-independence is 
delicate and the practical choice of burn- in and independence 
thresholds usually requires tedious calibration work. 

Backward sampling 

Our proposed alternative is to turn to exact sampling by 
introducing the backward quantities Bi(m) = F(C\Zi = m) 
which can be computed recursively like the forward quanti- 
ties defined above. 

Starting from B n (m) being equal to 0, except for B n (ni) = 
1, we get the following recursive formula for 1 ^ i ^ n: 

Bi-x{m) = mBiim + 1) + (1 - n)Bi(m). (2) 

One should note that we obtain P(C) = B o (0) in the process. 

Then we can sample phenotypes Yi sequentially with the 
following probability: 

P(y i = l|^_ 1 = m,C) = ^+il. (3) 

Bi-i(m) 

Since this expression depends on the corresponding 

probability obviously depends on the number of cases seen 
by i — 1. In mathematical terms, the sequence (Yi, Zi)i—±... n 
is a heterogeneous Markov chain. 

HO simulations 

HO simulations are simply performed by permutating 
the phenotypes l^s. A simple way to do this is to uni- 
formly choose a permutation a of {1, . . . ,n} by performing 
a 0(n log n) quicksort algorithm on a sample of n indepen- 
dent random variables with a uniform distribution on [0, 1]. 
For example, this can be achieved through the sample com- 
mand in R [24]. Alternatively, one may use waf f ect with the 
same probability m for all individuals. 

Toy-example dataset 

For validation purposes, we first considered a toy-example 
dataset with p = 1 SNP. The n genotypes Xi have the follo- 
wing distribution: 80% have genotype 0, 15% genotype 1, 
and 5% genotype 2. With n being a multiple of 20, these 
counts are always integers. For our disease model, we con- 
sidered the following relative risks: RRi =1.5 and RR 2 = 2.0 
(additive effect of 0.5). We hence obtained the following pen- 
etrances: f , fi = 1.5/o and f 2 = 2.0/q. In order to explore 
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Table 1: Running time for generating 100 replicates. 



the relevance and limits of each of our three approaches, we 
considered a wide range of values for n and f . 

Hapgen simulations 

In contrast with simulations under HI, simulations under 
HO are expected to provide few statistically significant results. 
This contrast can be observed through a binary (or two-class) 
classifier system. Regarding this aspect, we compared the be- 
havior of waf f ect with the one of the widely used simulator 
Hapgen [12511 under various conditions. The comparison re- 
lied on the so-called Receiver Operating Characteristic (ROC) 
curve, which is a graphical plot of the true positive classi- 
fication rate (sensitivity) as a function of the false positive 
rate (specificity), for a continuum of values of a given dis- 
crimination criterion. The performance of each simulator is 
summarized using the area under the ROC curve (AUC) : the 
larger the AUC, the more performant the simulator in gene- 
rating HI data contrasting with HO data (more details about 
ROC curves can be found below). 

Three input parameters must be specified in Hapgen: i) the 
range of the Minor Allele Frequency (MAF) of the simulated 
susceptibility SNP; here we considered [0.2,0.3] or [0.1,0.2]; 
ii) the disease prevalence, which we set to 0.01; iii) and the 
severity of the disease expressed through the relative risks 
RRi = /i //o for individuals heterozygous at the disease SNP, 
and RR 2 = /2//0 f° r homozygous individuals (rare allele). 
We considered a total of n — 629 subjects (n = 315 con- 
trols, ni = 314 cases) and p = 9, 579 SNPs in the region 
delimited by loci 558,390 and 13,930,000 on chromosome 
1 (Reference: HapMap files [19], CEU population reference, 
http : / /hapm ap . ncbi . nlm . nih . gov/] ) ■ 

For each condition, the penetrances f , fi and / 2 used 
with waf feet for comparison purposes were obtained per- 
forming several simulations using Hapgen and then averag- 
ing the corresponding penetrances. Four datasets were gene- 
rated. Each was then submitted to a genome-wide association 
study In the first dataset (Hapgen - HI hypothesis), Hapgen 
provided the genotypes together with the phenotypes. All 
the other datasets consisted in the aforesaid genotypic data 



together with phenotypes generated as follows. In the sec- 
ond dataset (Hapgen - HO hypothesis), the phenotypes un- 
der the HO hypothesis were generated through the method of 
permutations. In the third dataset (waffect - HI hypothe- 
sis) the phenotypes were affected with waffect . Finally, in 
the fourth dataset (waffect - HO hypothesis), we obtained 
the replicates by affecting the phenotypes after specifying in 
waffect a uniform probability m accross all individuals (this 
is equivalent to permuting the phenotypes). Besides, 1,000 
replicates were simulated for each of the four datasets and 
under each condition (MAF range and relative risks). In the 
end, 64,000 GWA studies were performed (2 MAF ranges x 
8 (RRi, RR2) pairs x 4,000 datasets). Moreover, the com- 
parison of the AUCs obtained for Hapgen and waffect were 
performed twice, because we used alternatively two distinct 
statistics of association as the discrimination criterion for the 
two-class classifier. 

1000 Genomes dataset 

The dataset consists of the genotypes of n = 629 individuals 
(?io = 315 controls, n\ = 314 cases) from the 1000 Genomes 
Project J22J. We focused on the first 100,000 SNPs on chro- 
mosome X. In the pretreatment stage, we filtered out all the 
SNPs with Minor Allele Frequency (MAF) less than or equal 
to 5%, ending up with n = 8, 048 SNPs. 

We arbitrarily considered a disease model with two inter- 
acting disease SNPs. The two SNPs respectively have the 
base-pair positions 627,641 and 1,986,325, MAFs 0.26 and 
0.23, and display no linkage disequilibrium (r 2 = 0.02). We 
considered an additive effect (3 > and an epistatic effect rj 
such that: 

7Ti = / X \ 1.0 + PXf ifJ^=0 

[ 1.0 + + f3(Xf + X?) ifXlXf^O 

where X\ and Xf correspond to the genotype of individual i 
at the two susceptibility SNPs. 

Statistics of association 

The association signal of each SNP (single marker analy- 
sis) is computed through the PLINK software [26] using the 
trend statistic. We denote pj the p-value of SNP j. In order 
to avoid the multi-testing issue, we considered the following 
single real valued statistic: 

S p = max - log 10 pj 

where J p denotes the subset of SNPs such that the distance 
between their loci and the locus of a disease SNP is less than 
a radius p 

For example, corresponds to taking the best p-value pj 
in negative (decimal) log-scale on the whole dataset. With 
p = 5 kb, we consider the best p-value of the SNPs within 5 
kb of one of the two susceptibility SNPs. The radius part of 
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Table 2: AUC and 95% confidence intervals, 100 replicates. 



the statistic is somewhat unusual since it takes advantage of 
information which is rarely available in practice: the location 
of the disease SNPs. However, in our simulation framework, 
this information helped us discuss the power of the design 
depending on the a priori location of the susceptibility SNPs. 
From a general point of view, starting with no a priori leads to 
a choice of p = oo; after a successful linkage analysis, p = 100 
kb might be more relevant, and if a candidate gene has been 
identified, p = 5 kb could be a good choice. 
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Figure 1 : Comparison of the area under the curves (AUCs) for 
Hapgen and Waffect - 1,000 replicates under HI and 1,000 
replicates under HO, respectively, for MAF in [0.2,0.3], ad- 
ditive model, RRi = 1.6, RR 2 = 2.2. The 95% confidence 
intervals are indicated. 



Results 



ROC curves and AUC 

In order to measure the performance of a given statistic 
for a given design and HI model, it is natural to compare 
the sensitivity and specificity of the testing framework for a 
given threshold of significance. For example, a global thresh- 
old of 5% is commonly accepted as a reasonable choice. It 
might be interesting to avoid doing this arbitrary choice by 
studying instead the Receiver Operator Characteristic (ROC) 
curve which is nothing but a graphical representation of the 
specificities and sensitivities that we can obtain for all possi- 
ble values of the threshold of significance H27H • 

The ROC curve itself can then be further summarized by the 
Area Under the Curve (AUC) which can be qualitatively inter- 
preted as follows: AUC < 0.6 means "fail"; 0.6 < AUC 0.70 
means "poor"; 0.7 < AUC 0.80 means "fair"; 0.8 < AUC < 
0.9 means "good"; 0.9 < AUC s$ 1.0 means "excellent". 

The AUC can easily be estimated from two samples of the 
statistic: one sample under HO, and the other under HI. All 
ROC curves and AUC computations (including confidence in- 
tervals) were performed using the R package pROC [12811 . 



Validation: toy-example dataset 

In Table [T] are depicted the execution times for generating 
100 replicates through the rejection, MCMC and backward 
sampling algorithms on the toy-example dataset, together 
with the corresponding probabilities P(C) of randomly draw- 
ing a complete configuration of phenotypes with exactly n\ 
cases. Different values of n and / are considered. P(C) dra- 
matically decreases when n grows and f becomes smaller. As 
a consequence, the running time of the rejection algorithm is 
already important (more than 10 hours) even for n as small 
as 40 and / as large as 0.2. The rejection sampling algo- 
rithm was not run for greater values of n as the corresponding 
probability of a successful finding is negligible. The MCMC 
and backward sampling algorithms have short running times 
which grow with n (very gently for the backward algorithm), 
and, most importantly, which do not depend on P(C). For the 
MCMC algorithm, we considered a burn-in of 10 5 x n. 

The Area Under the Curves and the corresponding 95% 
confidence intervals computed on the toy-example dataset 
replicates are shown in Table [2] The values found with the 
three methods are consistent. 
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Figure 2: ROC curves and AUCs at different values p of the 
radius of the intervals centered in the two disease SNPs. 

Validation: Hapgen simulations 

First, it has to be noted that even if the number of SNPs 
specified in Hapgen is constant across the replicates, the ob- 
served number of SNPs varies across these replicates. More- 
over, the disease SNP's MAF and the relative risks observed 
in the simulated data fluctuate from one replicate to another 
(see the Supplementary Material for explanation) . 

Figure[T]illustrates the case when the MAF range is [0.2, 0.3] 
and the genetic model is additive with RR4 and RR 2 values 
respectively set to 1.6 and 2.2. Despite the fact that Hapgen 
and our approach are not based on the same exact model, we 
can see a very good correlation between the two approaches, 
suggesting that they give similar results. 

In our study (629 subjects, 9,579 SNPs specified in Hap- 
gen, Power Edge R900 XEON X7460 2.6 GHz, RAM 128 GB), 
the generation and processing of 1,000 replicates required 75 
minutes or so using the backward sampling algorithm (under 
HI or HO) or Hapgen under HO. In contrast, under HI with 
Hapgen, 110 hours or so were necessary to generate and pro- 
cess data with the same dimension. 

More comprehensive comparisons with Hapgen are avail- 
able in the Supplementary Material. 

Application: Power study on the 1000 Genomes dataset 

The ROC curves of the statistics S p on the 1 000 Genomes 
Project dataset for different values of the radius p are dis- 
played in Figure [2j The AUC is very low (95% CI = 
[0.41,0.57]) when considering all SNPs (p = 00). The AUC 
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Table 4: AUCs at different values 77 of epistasis. Fixed values: 
/o = 0.1, additive effect f3 = 0.3, radius of the candidate 
region p = 5 kb. 

increases as the region around the two susceptibility SNPs be- 
ing investigated gets smaller. In particular, the performance 
is good for p < 5 kb. This roughly corresponds to the size 
of a gene. These results show that given the design and the 
disease model under consideration, the statistic which con- 
sists in simply taking the best p-value is successful in detect- 
ing a positive signal only in presence of a biological a pri- 
ori, thereby narrowing the investigation to the candidate gene 
level. 

We fixed a candidate region corresponding to the radius 
p = 5 kb and investigated the dependency of the AUC on the 
following parameters: epistasis rj, additive effect ft, probabi- 
lity /o of being a case in absence of the rare alleles, and total 
number n of individuals. 

Not unexpectedly, the AUC grows with the additive effect (3 
(Table [3]). More surprisingly, it also grows when the epistatic 
effect increases (Table [4]) despite the fact that our statistic S p 
only uses simple marker statistics. 

We then focused on the design of the GWA study itself 
by studying the effect of the total number of individuals n 
on the AUC. This was simply done by doubling (tripling, 
quadrupling) the original dataset before applying our di- 
sease model. In the case of our configuration of reference 
1] = 0.3, (3 = 0.3, f = 0.1 and p = 5 kb, we observed that the 
performance is excellent when the population is twice as large 
as the original one (AUC = 0.94 [0.90, 0.97] for n = 1258, 
AUC = 0.80 [0.74,0.86] for n = 629). This gain in power is 
even more dramatic when all the SNPs are taken into account 
(p = 00) : the AUC doubles in value when passing from 629 to 
1258 individuals (see Table [5). 

Finally, we investigated the effect of f , the probability of 
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n 


AUC [95% CI] 


629 


0.49 [0.41,0.57] 


1258 


0.78 [0.71,0.84] 


1887 


0.92 [0.88,0.96] 


2516 


0.93 [0.90,0.97] 



Table 5: AUCs for different sample sizes and on the whole 
dataset. Fixed values: epistasis r\ = 0.3, additive effect /3 = 
0-3, h = 0.1. 



fo 


AUC [95% CI] 


0.001 


0.73 [0.66,0.80] 


0.01 


0.71 [0.64,0.78] 


0.1 


0.80 [0.74,0.86] 


0.25 


0.94 [0.90,0.97] 



Table 6: AUCs at different values of penetrance f . Fixed 
values: epistasis r\ = 0.3, additive effect f3 = 0.3, radius of the 
candidate region p = 5 kb. 

being affected without any exposure, in the design of the 
GWA study (note that for small values, fo is close to the preva- 
lence). We can see in Table [6] that small values of f result in 
smaller AUCs (i.e. AUC = 0.73 [0.66,0.80] for f = 0.001), 
while large values of / result in much better results (e.g. 
AUC = 0.94 [0.90, 0.97] for / = 0.25). 

Discussion 

Validation 

The three algorithms we suggested give similar results but 
their performances are quite different. The rejection algo- 
rithm has a complexity of 0(N/¥(C)) to obtain N samples. 
This prevents its use in practical cases when P(C) can easily 
reach 10~ 50 or worse. The MCMC approach has the advan- 
tage of removing this dependence, and is simple to imple- 
ment. The drawback is that the burn-in parameter requires 
some calibration work (we suggest to use burn-in = 10 5 n). 
Moreover this heavy numerical method is rather slow. In 
our study, the backward sampling approach proved to be the 
fastest one, dramatically outperforming the MCMC alterna- 
tive (i.e. 500 times faster). From a theoretical point of view, 
the backward sampling algorithm has a space complexity of 
0(ni x n), and a time complexity of 0{n\ xn + N xn) where 
N is the desired sample size. Our C implementation makes 
it possible to generate a configuration of n = 20, 000 phe- 
notypes with m = 10,000 cases in 0.2 seconds on a 1.86 Gz 
workstation (Xeon E5320, 8 Go RAM, Linux 2.6.35). 

Comparison with Hapgen 

Our comparison with Hapgen clearly demonstrates that 



our approach is a valid alternative to this software. It also 
sheds light on the major differences between the two ap- 
proaches. First, in order to yield a realistic model for simula- 
ting genotypes, Hapgen requires additional information such 
as HapMap frequencies and recombination rates whereas our 
approach only needs the original genotypes. Regarding our 
simulations performed with Hapgen and waffect, running 
the newest Hapgen version would have entailed no change in 
the results: the newest Hapgen version is an extension of the 
former; it is not an improvement (at least regarding single 
susceptibility SNP simulation). Indeed, in its initial version, 
Hapgen limits the disease model to only one susceptibility 
SNR The most recent version can now simulate multiple di- 
sease SNPs, under the assumption that each disease SNP acts 
independently and that all such SNPs are in Hardy -Weinberg 
equilibrium, which are two strong constraints. Besides, this 
novel software, unfortunately only available in the 64 bit ver- 
sion, requires the use of an R package to simulate interactions 
between the disease SNPs. In contrast, simulating any disease 
model is straightforward with our approach, as long as it re- 
sults in an individual probability for each individual to be af- 
fected. For example, we can consider two or more susceptibil- 
ity SNPs, and easily add epistatic effects, Gene-Environment 
interactions, the contribution of rare variants, etc. Moreover, 
due to the multiple constraints it must take into account, Hap- 
gen only respects the specified model on average. As a con- 
sequence, the effective MAF and relative risks can vary from 
one run to another. Besides, selecting for each Hapgen repli- 
cate a different disease SNP entails variations in the number 
of SNPs available for the association test (see the Supplemen- 
tary Material for more details). Finally, since our approach 
only generates a fraction of the data (the phenotypes), it is 
outstandingly faster than Hapgen. Furthermore, since with 
our approach the genotype remains untouched, any numeri- 
cally intensive analysis which is performed on the genotype 
table (g.e. LD computations) needs to be performed only once 
while with Hapgen it has to be done for each replicate. 

Effects 

In our application study, we showed how our strategy can 
be used to investigate various effects on the performance of a 
GWA study. We first pointed out that with the suggested de- 
sign, our modest disease model (/3 = rj = 0.3, fo = 0.1), and 
the analysis we chose (simple marker using PLINK), the sig- 
nal is only statistically detectable by limiting the region to be 
investigated to a 5 kb radius around the susceptibility SNPs. 
This clearly shows that the present design is practical only 
if previous work (e.g. linkage analysis) suggests candidate 
genes. 

Then we proved that the AUC increases when either the 
additive effect (/?) or the epistatic effect (rf) increases. While 
the result for the additive effect is not surprising, the result for 
the epistatic effect is worth commenting upon. Indeed, in our 
application, we deliberately stuck to a classical simple marker 
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analysis (PLINK using the trend test) which is not designed 
to detect epistatic effects. However, our study showed that 
the corresponding marginal effects can improve the overall 
statistical performance of the analysis even when no specific 
effort for the detection of epistasis is performed. It would be 
interesting to do further research to compare simple marker 
analysis to more complex analysis especially designed to take 
into account epistatic effects. 

Design 

We finally briefly investigated the influence of the design of 
the GWA study on its performance. As could be expected, the 
total number n of individuals plays a critical role. It might 
be interesting however to refine this analysis by also consid- 
ering different ratios rti/n . This is left for further investiga- 
tion. We also showed that for a fixed design (fixed number 
of cases and controls) the prevalence of the disease (through 
fo, which is closely related) has an unexpected effect on the 
performance: higher prevalence gives better results. More 
investigations might be necessary to understand the reason 
behind this observation. 

Extensions 

One should note that our approach can be extended in se- 
veral directions. First, we can easily consider more than two 
classes, thus complexifying constraint C. In this case, the re- 
jection and MCMC algorithms remain exactly the same. For 
backward sampling, one should first affect one class against 
the others, and then perform the affectation recursively on 
the remaining unaffected classes. The resulting complexity 
is 0(n x (K — 1)) for the affectation of n individuals into K 
classes. As explained above, we can also consider sophisti- 
cated genetic models that take into account additional cova- 
riates such as environmental exposure or rare variants. Fi- 
nally, the extension of our approach to other fields is quite 
straightforward. For example, in the analysis of gene ex- 
pression, our approach might be used to affect sample status 
(e.g. cancer or healthy) conditional on the level of expres- 
sion rather than generating expressions through parametric 
or non-parametric questionable models as is usually done. 

Conclusions 

In this paper, we present an alternative to classical simu- 
lations under HI in GWA studies. The idea is to generate 
the phenotypes conditional on the existing genotypes with re- 
spect to the study design (number of cases). For that purpose, 
we suggest three algorithms including the backward sampling 
which mimics Hidden Markov Models to provide a fast sam- 
pling conditional on the constraint. Our study shows that 
our algorithms are valid and that their results are consistent 
with reference software such as Hapgen. Moreover, our ap- 
proach has many advantages: it is much faster; it does not 
need any genotyping model (genotypes remain untouched); 
it makes it possible to consider any complex genetic model 



(including several SNPs, epistasis, covariates, rare variants, 
etc.). Our approaches are available in an R package called 
waf f ect ("double-u affect", for weighted affectations). The 
beta version can be downloaded from 
www . math- inf o .univ-paris5 . f r/~vperduca . 
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Appendix 

In this section we give the proofs of the mathematical re- 
sults from the Material and Methods section, namely of equa- 
tions ^ and Recall that Zj = Y?i=i Y i is the 
number of cases among individuals 1, . . . , j and that we deal 
with constraint C = {Z n = n±}. The forward and back- 
ward quantities are the probabilities Fi(m) = P(Zj = m) 
and Bi{m) = V(C\Z t = m). Let V be the set of variables 

{^lj ■ ■ ■ , Y n , Zl) • • - > Z n \. 

The key property is that the conditional dependencies 
among all the variables determine the following factorization 
of the joint probability distribution 

F(V) = l[F(Z j \Z j _ u Y j )F(Y j ), (4) 

where I = {1, . . . , n} is the set of all the individuals and Z = 

by convention. 

Theorem 1. For each individual i = 1, . . . , n: 

P{Z l =m,C) = F l (m)B l (m). (5) 

Moreover 

P(F 4 = 1, Zi = m + 1, = 771, C) = 

F i _i(m)ir i B i (m+ 1), (6) 

¥(Y l = 0,Zi= m, Z t _ 1 = 777, C) = 

F i _ 1 (m)(l-7r i )S i (m). (7) 

Proof. We will only prove Eq. ([5]): similar arguments hold 
for equations (|6]) and ([T]). The marginal probability P(Z, = 
to, C) is obtained from the joint probability distribution P(V) 
multiplying it by the indicator function 1q (by definition lc = 

1 if and only if C is true) and summing out all the variables in 
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V\{Z t } = {Zi,...,Zi-i,Zi + i,...,Z n ,Yi,...,Y n }. Because 
of Eq. Q, we have 

¥(Zi = m,C)= 1, II ^ j (Z j ,Z j (8) 
v\{Zt} jei 

where for convenience we denote Qj(Zj, Zj_x,Yj) = 
FiZjlZ^Y^Yj). 
Now consider the sets 

Vl:i '■= \Zi,Yi, . . . , Zi, Yj,}, 

V(i+l):n '■= {Zi+l,Yi + i, . . . , Z n ,Y n }. 

Vi-i U V(j +1 ) : „ is a partition of V and V \ {Z l } = Vx-.j - {Zi} U 

V(j+l) :n . 

Note that Eq. ([8]) is equal to 



e n % e ^ n 

Vi :i \{2i} l<i<2 i+l<i<n 

e n %V ( e ^ n - . < 

vVi:A{2<} l<i<* / \V( 4+ i)i„ i+l<?'<n 

where the last equality holds because the only variable shared 
between the two factors is Z< which is set to be Zi — m from 
the beginning. It is now easy to see that the two factors in the 
product above are Fi(m) and Bi(m) respectively. □ 

Corollary 2. The equations @ and (0 make it possible to 
compute the forward and backward quantities recursively: 

Fi(m) = Fi_i(m - 1)^ + F,_i(m)(l - m), 
Bi_i(m) = iTiBi(m + 1) + (1 - n^B^m). 

Proof. We will prove only the recursive formula for a si- 
milar argument holds for F». Note that 

P(Z< = m,C) = F 2 (m)7r J+1 B. 1+1 (m + 1) 

+ Fi{m)(l - w i+ x)B i+1 (m) = F^mjB^m) 

By dividing by Fi(m), we obtain the iterative formula for 
the backward quantities. □ 

Corollary 3. We can now prove Eq. 

Bi-i(m) 

/or each individual i = 1, . . . , n. 

Proof. Apply the definition of conditional probability and di- 
vide Eq. QbyEq. ([5]). □ 

Corollary 4. 

P(Y t = l\C) oc F l {m)n l B l {m + 1), 

m 

FQr 1 = 0\C)<x^F i - l (m)(l-n)Bi(m). 
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